Methods of Selection, Reporting and Analysis of Genetic Markers Using Broad-Based Genetic Profiling Applications

ABSTRACT

Disclosed is a method for determining whether an individual has an enhanced, diminished, or average probability of exhibiting one or more phenotypic attributes and related methods of selecting a set of genetic markers; for providing relevant genetic information to an individual; of evaluating the probability that progeny of two individuals of the opposite sex will exhibit one or more phenotypic attributes; and for determining the genomic ethnicity of an individual.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application of patent applicationSer. No. 14/542,524, filed Nov. 14, 2014, which application is acontinuation application of U.S. patent application Ser. No. 13/170,088,filed Jun. 27, 2011, which application is a divisional application ofU.S. patent application Ser. No. 10/552,665, filed Oct. 11, 2005, nowU.S. Pat. No. 8,417,459, which application is a National Stage filing ofInternational Application No. PCT/US2004/10905, filed Apr. 9, 2004,which application claims the benefit of U.S. Provisional Application No.60/461,740, filed Apr. 9, 2003 and U.S. Provisional Application No.60/492,707, filed Aug. 4, 2003, each of which is hereby incorporated byreference in its entirety.

BACKGROUND OF INVENTION

Personalized human health care products and services that enableindividuals to more actively manage their health based on their geneticprofiles have been increasingly heralded following the publication of adraft human genome sequence in June 2000 (Venter, J C, Funct IntegrGenomics. 2000 November; 1(3):154-5) and a substantially completesequence of the human genome in February 2001. (Venter, J C et al.,Science 291(5507):1304-51 [2001]; Lander E S et al., Nature409(6822):860-921 [2001]). To date, however, the commercial availabilityof personalized genetic profile products and services has been extremelylimited and costly.

The “genome” of an individual member of a species comprises thatindividual's complete set of genes. Particular locations within thegenome of a species are referred to as “loci” or “sites”. “Alleles” arevarying forms of the genomic DNA located at a given site. In the case ofa site where there are two distinct alleles in a species, referred to as“A” and “B”, each individual member of the species can have one of fourpossible combinations: AA; AB; BA; and BB. The first allele of each pairis inherited from one parent, and the second from the other.

The “genotype” of an individual at a specific site in the individual'sgenome refers to the specific combination of alleles that the individualhas inherited. A “genetic profile” for an individual includesinformation about the individual's genotype at a collection of sites inthe individual's genome. As such, a genetic profile is comprised of aset of data points, where each data point is the genotype of theindividual at a particular site.

Genotype combinations with identical alleles (e.g., AA and BB) at agiven site are referred to as “homozygous”; genotype combinations withdifferent alleles (e.g., AB and BA) at that site are referred to as“heterozygous.” It has to be noted that in determining the allele in agenome using standard techniques AB and BA cannot be differentiated,meaning it is impossible to determine from which parent a certain allelewas inherited, given solely the genomic information of the individualtested. Moreover, variant AB parents can pass either variant A orvariant B to their children. While such parents may not have apredisposition to develop a disease, their children may. For example,two variant AB parents can have children who are variant AA, variant AB,or variant BB. For example, one of the two homozygotic combinations inthis set of three variant combinations may be associated with a disease.Having advance knowledge of this possibility allows potential parents tomake the best possible decisions about their children's health.

Diseases are often associated with the collection of atoms, molecules,macromolecules, cells, tissues, organs, structures, fluids, metabolic,respiratory, pulmonary, neurological, reproductive or otherphysiological function, reflexes, behaviors and other physicalcharacteristics observable in the individual through various means. The“phenotype” of an individual refers to one or more of these observablephysical characteristics. An individual's phenotype is driven in largepart by constituent proteins in the individual's proteome, thecollection of all proteins produced by the cells comprising theindividual and coded for in the individual's genome.

In many cases, a given phenotype can be associated with a specificgenotype. For example, an individual with a certain pair of alleles forthe gene that encodes for a particular lipoprotein associated with lipidtransport may exhibit a phenotype characterized by a susceptibility to ahyperlipidemous disorder that leads to heart disease.

While efforts have been undertaken to create new personalized activehealth management products and services based on genetic profiles,several shortcomings characterize the existing art. Among theseshortcomings are the following:

First, the mix of existing products and services are in the aggregatenarrowly focused on a small set of disease phenotypes, making theminefficient in enabling health management practices that encompass abroad set of phenotypes;

Second, existing genetic testing products and services are each focusedon a genetic indication for one or a small set of diseases;

Third, until the high cost of sequencing the genome of an individualhuman declines by several orders of magnitude, an alternative to genomesequencing technology must be used as the basis for genetic profileproducts and services, and currently available alternatives requiresubstantial modification in order to be integrated into the array oftechnologies and logistics necessary to provide genetic profile productsand services encompassing a comprehensive set of diseases;

Fourth, existing informatics and database management tools do not scaleefficiently or effectively to the dynamic and exponential growth ofreported scientific research and clinical findings underlying geneticprofile products and services, resulting in a high degree of informationobsolescence;

Fifth, existing genetic profile products and services are designed to beused at key life events, such as disease onset, family disease onset,preconception and prenatal events, and typically by one or more membersof a family with an already-known history a particular disease among itsgenerations, rather than as part of a comprehensive personalized healthmanagement program; and

Sixth, genetic counseling practices, focused on point tests assessed atkey life events must be significantly altered to support the increase ininformation volume and complexity arising from broad-based geneticprofiling.

The objective of personalized genetic profile health management productsand services is to provide individuals with information about theirpredisposition to diseases. Armed with this information, individualscan, in many instances, make decisions about their dietary practices,pharmaceutical use, exercise, and other lifestyle habits that aredesigned to better manage their predisposition to diseases.

From individual to individual within any species, genes arecharacterized by a very high degree of conservation in the sequence ofnucleotide base pairs comprising them. At certain locations in manysites, however, the specific nucleotides that comprise a gene canundergo alteration, or mutation. Mutations can be inherited from aparent or acquired during a person's life. A hereditary mutation will bepresent in all of a person's cells and will be passed on to futuregenerations, because the person's reproductive cells (sperm or egg) willcontain the mutation. An acquired mutation can arise in the DNA ofindividual cells as a result of many possible factors. For example,mutations in the DNA of skin cells can be caused by exposure to thesun's UV radiation. Genetic mutations in other cells can arise fromerrors that occur just prior to cell division, during which a cell makesa copy of its DNA before dividing into two. Genetic profile products andservices tend to focus on hereditary mutations.

The situation in which two or more sequence variants of an allele existat a site across different members of a population is called a“polymorphism,” typically defined as having an occurrence frequency oflarger than 1% within that population. Several different types ofpolymorphisms are known in the art. By far the most common form ofpolymorphisms are those involving single nucleotide variations betweenindividuals of the same species; such polymorphisms are called “singlenucleotide polymorphisms”, or “SNPs”. To date, at least 1.42 millionSNPs have been identified in the human genome. (Sachidanandam R et al.,Nature 409(6822):928-33 [2001]). While it is believed that the greatpreponderance of these SNPs are harmless, there is a substantial numberthat have been associated with various diseases.

SNPs that occur in the protein coding regions of genes that give rise tothe expression of variant or defective proteins are potentially thecause of a genetic-based disease. Even SNPs that occur in non-codingregions can result in altered mRNA and/or protein expression. Examplesare SNPs that defective splicing at exon/intron junctions. Exons are theregions in genes that contain three-nucleotide codons that areultimately translated into the amino acids that form proteins. Intronsare regions in genes that can be transcribed into pre-messenger RNA butdo not code for amino acids. In the process by which genomic DNA istranscribed into messenger RNA, introns are often spliced out ofpre-messenger RNA transcripts to yield messenger RNA.

For example, in the “healthy” form of the protein hemoglobin, the aminoacid at the sixth position in the protein's beta chain is glutamic acid.This amino acid is encoded in the hemoglobin gene by the DNA codonguanine-adenine-guanine (GAG). In some individuals, however, the adeninenucleotide in this codon is replaced with the thymine nucleotide,resulting in a GTG codon which codes for the amino acid valine. Thissubstitution of valine for glutamic acid alters the normal shape of thehemoglobin protein. Red blood cells that contain these abnormally shapedhemoglobin proteins exhibit a sickle shape and are unable to perform theoxygen-transport function normally associated with red blood cells.Individuals who are GTG homozygous (i.e., have inherited a GTG variantfrom each parent) suffer from sickle cell anemia.

In addition to sickle cell anemia, SNPs have been associated withdiseases such as cystic fibrosis, Huntington's chorea, beta-thalassemia,muscular dystrophy, fibro muscular displasia, pheny ketonuria, Type IIdiabetes, a hyperlipidemous disorder associated with Apolipoprotein E2,at least one form of hypertension, and some forms of migraine headaches.These disease-associated SNPs are inherited through classic Mendelianmechanisms. This type of SNP, however, is not believed to be thepredominant form of SNPs associated with the most common diseases. Thisview is supported by the theory that common mutations in the proteincoding regions would entirely dysfunction protein structures andtherefore completely shutdown a specific pathway or parts of suchpathways, a result which is not supported by observation. Nevertheless,it is believed that functional variants associated with phenotypesfurther associated with diseases should be clustered around non-codingsites that play an important role in the functioning of the genome.

An example of such functional, non-coding sites are the “splice sites”at which pre-messenger RNA transcripts are spliced into messenger RNA(mRNA). The need for splicing arises from the fact that within thepre-messenger RNA transcripts are RNA base pairs that correspond tointrons in the genomic DNA from which the pre-messenger RNA transcriptderives. The complex of proteins and RNA at which splicing occurs iscalled the “spliceosome”. (See, e.g., Fairbrother et al. 2002).

A few different methods are commonly used to analyze DNA forpolymorphisms and genotype. The most definitive method is to sequencethe DNA to determine the actual base sequence (see, A. M. Maxam and W.Gilbert, Proc. Natl. Acad. Sci. USA 74:560 (1977); Sanger et al., Proc.Natl. Acad. Sci. USA 74:5463 (1977)). Patent application 20020082869,“Method and system for providing and updating customized health careinformation based on an individual's genome”, Anderson, Glen J.,describes a system for delivering personalized genetic profilinginformation based on sequencing. Although such a method is the mostdefinitive it is also the most expensive and time-consuming method.Accordingly, the sequencing of the human genome has only been performedfor research purposes such as the Human Genome Project on samples from avery small number of individual humans, and at a cost of millions ofdollars per individual. While the cost of sequencing the genome of anindividual human has been following a steeply decliningprice/performance curve, where performance is measured in terms ofaccuracy and time, the substantial cost that still stands todayprohibits its use on a broad commercial scale. Until the cost ofsequencing technologies declines substantially further, the delivery ofgenetic profiles to a significantly large number of individuals cannotbe cost effectively based on genome sequencing. Moreover, as describedbelow, simply being able to sequence an individual's genome is notsufficient to generate and provide a comprehensive genetic profileproduct or service to the individual.

Another method of analyzing DNA for polymorphisms and genotype isrestriction mapping analysis. With this method genomic DNA is digestedwith a restriction enzyme and the resulting fragments are analyzed on anelectrophoresis gel or with a Southern blot to determine the presence orabsence of a polymorphism that changes the recognition site for therestriction enzyme. This method can also be used to determine thepresence or absence of gross insertions or deletions in genomic DNA byobserving the lengths of the resulting DNA fragments. In this respect,restriction mapping analysis has limited use in the type of genome-widesearch for polymorphisms and genotyping analysis required for providinggenetic profile products and services of the type contemplated by thepresent invention.

Another method of determining the genotype of an individual at a givensite is to detect the presence of one or more nucleotide sequences atthat site known to be associated with a predisposition, disease or otherphenotypic abnormality. These sites, also called “genetic markers,” canbe detected using various tagged oligonucleotide hybridizationtechnologies that are significantly less costly than genomic sequencingand allele-specific hybridization. Means now exist for constructing andperforming large-scale, multiplexed genetic marker hybridization testson biological samples from individuals, such as samples of blood, salivaand urine. These means, such as very dense chip and bead arrays, canenable a sample from an individual to be tested simultaneously for thepresence of thousands of genetic markers. (Oliphant A et al.,Biotechniques Suppl:56-8, 60-1 [2002]; and Fodor S P, Science251(4995):767-73 [1991]).

Splice junctions in pre-messenger RNA, 5-prime (exon to introntransition) and 3-prime (intron to exon transitions), are the sequenceregions that are used as recognition sites for the spliceosome andcontain certain sequence motifs which typically are conserved betweenrelated species. Nucleotide changes in these binding sites can have asubstantial effect on the spliced mRNA product, depending on the effectof the mutation on the overall binding affinity of the spliceosomecomponents with the mRNA sequence. Known mis-splicing behavior arisesfrom exon skipping, alternative splicing, protein coding truncationthrough the introduction of a frame shift, and the disruption of theentire mRNA production process. These changes have significant effectsin the mRNA and protein processing step and can totally change theirproduction. In addition, smaller changes can partially regulate andinfluence quantitatively the splicing behavior of certain genes.Additional sites known to be involved and sometimes even known toregulate splicing, are the branch-point, enhancer and silencer sequences(Fairbrother et al. 2002). Splice sites constitute locations in thegenome for evolutionary pressure to function through nucleotidemutations.

Similarly, promoter regions in genes constitute locations in the genomewhere the presence of a SNP can be used for determining an individual'sgenotype. As gene-expression regulatory mechanisms, promoter regionsinclude the transcription start site and various transcriptionfactor-binding sites, including all the regions that are involved ingene regulation.

The determination of the presence of polymorphisms or, less frequently,mutations, in DNA has become a very important tool for a variety ofpurposes. Detecting mutations that are known to cause or to predisposepersons to disease is one of the more important uses of determining thepossible presence of a mutation. One example is the analysis of the genenamed BRCA1 that may result in breast cancer if it is mutated (see, Mikiet al., Science, 266:66-71, 1994). Several known mutations in the BRCA1gene have been causally linked with breast cancer. It is now possible toscreen women for these known mutations to determine whether they arepredisposed to develop breast cancer. Some other uses for determiningpolymorphisms or mutations are for genotyping and for mutationalanalysis for positional cloning experiments.

In some cases, as illustrated in the case of the hemoglobin SNP andsickle cell anemia, the association of a SNP with a disease is directand well-established and can be simply diagnosed. In many other cases,however, the association of a SNP with a phenotype that gives rise todisease or other adverse medical condition is not well-established anddifferent diseases, disorders have different associations with differentgenotypes and SNPs. In these cases, the association between genotype andphenotype can vary from individual to individual in a complex mannerthat depends on the individual's genome, age, family history, life stylehabits, and other personal health and demographic factors. Consequently,direct testing of the individual's DNA is not accurate 100% of the timein predicting the onset of a genetically-based disease or other adversemedical condition. In these more complex cases, there is a probabilisticrelationship between genotype (as characterized by different variantsand SNPs) and phenotype (as characterized by the association ofphenotypes with different diseases). In these cases, the presence of aSNP at a given genetic site is not sufficient by itself for thedevelopment of a pathological condition. In addition, not all personspossessing a given SNP in a given variant will develop a diseaseassociated with that SNP. The onset of a genetically-based disease mayalso depend on exposure to certain conditions in a person's environment.Moreover, the same disease, disorder, or other adverse medical conditionassociated with a given SNP in a given variant may result from adifferent SNP at another site. Consequently, comprehensive analysis ofthe relationship between an individual's genotype and phenotype requiresa scoring matrix of variables along various dimensions and a method ofusing this matrix to determine the probability that a given genotype ina given individual will result in a given phenotype.

To further illustrate the complexity of associating genotypes withphenotypes, it is currently believed that the human genome comprisesapproximately 30,000 genes while the human proteome comprisespotentially millions of proteins. The process by which the informationcontained in the DNA comprising 30,000 distinct genes is transcribedinto messenger RNA, which is in turn translated into the sequence ofamino acids comprising potentially millions of distinct proteins,therefore adds significant complexity to associations of genotypes withphenotypes. In addition, in the search for unknown disease-causingvariants, whole-genome association scans using hundreds of thousands ofgenetic markers simultaneously are likely to face serioustheoretical-statistical challenges, as well as practical difficultiesassociated with the management of data sets of enormous size andcomplexity. One obvious problem is the fact that, the more geneticmarkers are used, the higher the expected number of apparent, spuriousassociations that are the result of statistical chance as opposed totrue association stemming from shared genealogy between genetic markerand causative allele.

Beyond this complexity of associating genotypes with phenotypes, therehas been rapid growth of data on the existence of SNPs, their locationsin the human genome, and associations of SNPs with phenotypes that arefurther associated with various diseases. This data arises from researchon genomics, proteomics, preclinical and clinical studies ofpharmaceuticals and related research gathered from laboratories,hospitals and medical clinics around the world.

Genomics has the potential to change the way medicine is practiced andimpact the health of individuals. (E.g., Guttmacher A E, Collins F S,Genomic medicine—a primer, N Engl J Med 347: 1512-20 and Collins [2002];Varmus, Getting ready for gene-based medicine, N Engl J Med 347: 1526-7[2002]). Through sequencing and genotyping, extensive personal geneticinformation is expected to continue to be generated in large quantitiesin coming years. (Trager R S, DNA sequencing. Venter's next goal: 1000human genomes, Science 298: 947 [2002]). The rapid growth in volume ofavailable genomic and proteomic data has been characterized bysubstantial disorder and information obsolescence. For example,currently, the Gene Ontology (GO) consortium (Ashburner, M et al., Geneontology: tool for the unification of biology. The Gene OntologyConsortium. Nat Genet 25: 25-9 [2000]) and the National Library ofMedicine's MeSH (Schulman J-L, What's New for 2001 MeSH, NLM Tech Bull.317 [2001]) are two of the best-known ontologies in the bioinformaticsdomain. Neither of these ontologies, however, currently contains thenecessary information to support research about the relationshipsbetween genes and disease in the context of the human genome. While GOis well suited to classify a gene product in terms of its function,process and location, it has no terms to describe human diseases and therelations between them, whereas MeSH, while rich in descriptions andclassifications of human disease, contains no information aboutsequences, little information about genes, and no information aboutdisease causing mutations and SNPs. This is an unfortunate situation,especially in light of the recent completion of the human genomesequence, and its annotation.

A key aspect of research in genetics is the association of sequencevariation with disease genes and phenotypes. Sequence variation data arecurrently available, for example, from OMIM, HGMD (Hamosh A, et al.,Online Mendelian Inheritance in Man (OMIM), a knowledgebase of humangenes and genetic disorders, Nucleic Acids Res 30: 52-5 [2002];Krawczak, M et al., Human gene mutation database—a biomedicalinformation and research resource, Hum Mutat 15: 45-51 [2000]; McKusickV A, Online Mendelian Inheritance in Man, OMIM (TM), McKusick-NathansInstitute for Genetic Medicine, Johns Hopkins [2000]) and others, bothof which provide phenotypic information and describe amino acidvariation. Unfortunately, in most cases these variation references donot provide sufficient information to support their direct mapping ontocurrent genomic sequences and the associated annotated genes. Singlenucleotide polymorphism (SNP) data are held in dbSNP and other publiclyaccessible databases. (E.g., Sachidanandam, R et al., A map of humangenome sequence variation containing 1.42 million single nucleotidepolymoiphisms, Nature 409: 928-33 [2001]; Sherry, S T et al., dbSNP: theNCBI database of genetic variation, Nucleic Acids Res 29: 308-11.2001).While these databases contain millions of entries each including theposition of the SNP on the genome, they do not provide significantphenotypic information about the SNPs at the levels which need to bereached, namely from the genome to the phenotype and the clinic.

Moreover, as the volume of genomic and proteomic data grows there arerequirements to synthesize vast amounts of information to enable clearerunderstanding of an individual's genetic profile. Patent application20020052761 (“Method and system for genetic screening data collection,analysis, report generation and access”, Fey, Christopher T.; et al.)describes a system for generating highly complex personal health reportsto individuals concerning their genetic test results, based on anaggregate set of genetic markers and phenotypes.

Over the years most genomic and clinical advances have been published inscientific journals. Molecular biology advances and finding have beenpublished predominantly in molecular biology journals (e.g., Cell, Nat.Genetics, Am J. Hum. Genetics, etc.), and clinical phenotype relatedfindings have been published predominantly in medical journals (e.g., NEng J Medicine, Lancet, etc.). Because of these different journals aredirected to different communities, large communication gaps have beencreated. Thus, there now exists in the public domain two distinctinformation resources, and neither is as valuable as it potentiallycould be because current research efforts require their integration. Onepart includes all large public genomic databases and the other is thevast amount of clinical research data, mostly held in publication, butincreasingly accessible electronically. There is a clear tendency in thecommunity for a computer-based classification of disease throughontologies and relating medical diagnostic classification schemes suchas the ICD-9 with gene diseases (e.g., the NuGene project; Chisholm, Ret al., The Nugene Project [2003]).

There are currently various standardization efforts occurring withinmolecular biology, most notably the gene ontology (GO) consortiumefforts. (Ashburner, M et al., Gene ontology: tool for the unificationof biology. The Gene Ontology Consortium. Nat Genet 25: 25-9 [2000]).Additional ontologies such as the sequence ontology, the mutationontology and others are in work in process (for an overview of ontologydevelopment please see the “global open biological ontologies” (GOBO)web site (www.geneontology.org/doc/gobo.html).

Databases containing information on polymorphisms are also expected tohave an important impact in the field of pharmacogenomics.Pharmacogenomics is an area of research focused on how variations in apatient's DNA can cause pharmaceuticals to respond differently. Theimportance of understanding these variations is underscored by thenumber of hospitalizations and deaths that occur each year that arecaused by adverse drug reactions. One method of characterizing thegenetic basis of drug response is by cataloging variations in drugresponse as a function of SNPs. The more SNPs cataloged, the more robustand effective the database. However, collecting and sorting the SNPsbecomes a huge undertaking. In U.S. Patent application 20020049772,Reinhoff, et al, provides a broad overview of polymorphisms,pharmacogenomics, and classifying populations based upon sets ofpolymorphisms.

In addition to the scientific, technological and medical complexitiesthat characterize the development and commercialization of geneticprofile products and services, there are growing legal and regulatorycomplexities. For example, patient privacy has been a growing concern inmultiple jurisdictions. In Europe, the European Union Directive 95/46/ECis designed to protect individuals with regard to the processing andmovement of their personal data. In the United States, under the HealthInsurance Portability and Accountability Act of 1996, commonly referredto as “HIPAA”, regulations have been adopted that set forth “Standardsfor Privacy of Individually Identifiable Health Information”. Thepurpose of these regulations is to help guarantee privacy andconfidentiality of patient medical records. These Standards are quiteextensive and apply to health care providers, insurers, payers andemployers.

The confluence of all of the factors discussed above leads to theconclusion that what has been lacking from the art, but necessary forviable broad-based commercial provision of personalized healthmanagement products and services based on genetic profiling, is a methodthat satisfies the following requisites:

(1) the genotype of the individual to whom such products and servicesare being provided must be accurately and economically determinable at alarge number of sites in that individual's genome relevant to a broadselection of different diseases;

(2) a large, dynamic, well-curated, database containing the associationsbetween diverse genotypes and phenotypes must be maintained, easilyaccessed, and updated at very frequent intervals;

(3) for each individual to whom such products and services are beingprovided, the individual's genotype at each such site must be easilyanalyzed and filtered through such database to determine theindividual's phenotype and construct the individual's genetic profile;

(4) the genetic profile so constructed for each individual and itsimplications must be easily communicated to the individual and theindividual's physicians and medical/health care counselors in aneffective manner that complies with health care, privacy and other lawsand regulations.

Various means exist for practicing each of these separately. Each suchmeans, however, suffers from various deficiencies, and a method ofcollectively optimizing their combined practice is required in order toprovide health care management products and services on a broadcommercial scale at prices that are economically attractive to bothprovider and customer. The present invention provides these and otherbenefits.

SUMMARY OF INVENTION

The present invention relates to a method for determining anindividual's probability, whether enhanced, diminished, or averageprobability, of exhibiting one or more phenotypic attributes throughevaluating genomic markers from that individual for zygosity for themembers of a preselected set of genetic markers. In accordance with themethod, the markers are compared to a multivariate scoring matrix toobtain a marker score, from which it is determined whether an enhanced,diminished, or average probability of exhibiting one or more phenotypicattributes is indicated. The multivariate scoring matrix correlatespatterns of marker zygosity with probabilities of exhibiting phenotypicattributes.

The present invention is further directed to a method of selecting a setof genetic markers. The method involves filtering markers for inclusionin the set, based on a determination of measures of phenotypic valueand/or prioritization, such as but not limited to, penetrance of themarker in a population or subpopulation of interest; the degree oflinkage of the marker to a particular phenotype; the relativecontribution of the marker to communicating the phenotype; and thedegree of statistical or scientific confidence to be placed in any dataassociated with any of the measures of phenotypic value or priorityused.

The present invention further relates to a method for providing relevantgenetic information to an individual, or about an individual to anotherinterested party (e.g., a physician, veterinarian, researcher, breeder,or owner). The method involves identifying genotypic characteristics ofthe individual that correlate with a relative probability of exhibitingone or more phenotypic characteristics. The method also involvesdetermining for each of the one or more phenotypic characteristicswhether the individual has an enhanced, diminished, or averageprobability, of exhibiting the characteristic by evaluating genomicmarkers for zygosity (for example, but not limited to heterozygosity orhomozygosity) at each member of a preselected set of markers andcomparing the zygosity of the markers to a multivariate matrix thatcorrelates patterns of marker zygosity with probabilities of exhibitingphenotypic attributes, determining whether the marker score resultingfrom this comparison indicates an enhanced, diminished, or averageprobability of exhibiting the one or more phenotypic attributes. Thenone or more selection criteria is applied for each of the one or morephenotypic characteristics. Each selection criterion applied imposes atotal, a partial, or no limitation on the information communicated tothe individual. Subsequently, information is identified that is relevantto the individual's probabilities of exhibiting the one or morephenotypic characteristics and is consistent with the limitationsimposed by the selection criteria, and the information is communicatedto the individual.

The present invention is further directed to a method of evaluating thegenetic profile combination (i.e, “compatibility”) of two individuals ofthe opposite sex (i.e., male and female) of the same species, whetherhuman or a non-human species, in other words, a method of evaluating theprobability that progeny of two individuals of the opposite sex willexhibit one or more phenotypic attributes. The method involvesevaluating genomic markers from each of the two individuals for zygosity(for example, but not limited to marker heterozygosity or homozygosity)at each member of a preselected set of markers, determining aprobability distribution for the zygosity for each member of thepreselected set of markers in the genomes of the progeny of the twoindividuals, and comparing the probability distributions to amultivariate matrix to obtain a probability distribution score, whereinthe matrix correlates patterns of marker zygosity with probabilities ofexhibiting phenotypic attributes, and determining whether theprobability distribution score indicates that the progeny of the twoindividuals have an enhanced, diminished, or average probability ofexhibiting one or more phenotypic attributes.

The present invention further provides a method for determining thegenomic ethnicity of an individual comprising evaluating genomic markersfrom an individual at each member of a preselected set of geneticmarkers, comparing the genotype for each of the markers to amultivariate matrix, wherein the matrix correlates patterns of genotypeswith probabilities of exhibiting phenotypic attributes, and determiningthe genomic ethnicity of the individual as a pattern of theprobabilities of exhibiting the phenotypic attributes.

The inventive methods can be used to provide individualized healthmanagement products and services based on an individual's geneticprofile as well as for pharmacogenomic studies in human or otherpopulations or subpopulations of varying sizes and compositions (e.g.,members of defined ethnicity, gender, age, or disease or otherphysiologically abnormal condition, etc.). They can also be applied tothe analysis and reporting of genetic profiles of various humanpopulations and subpopulations, as well as populations andsubpopulations of other non-human species.

Thus, the present invention provides the benefits of economical andeffective genetic profiling that involves (i) genotyping each individualwho seeks such products and services for a broad set of geneticvariants, (ii) scoring these variants for their association with theindividual's phenotypic susceptibility to the later onset of variousdiseases, (iii) providing information on such genotype and phenotypeinformation in a manner that can be understood by the individual and theindividual's physicians and medical/health care counselors so thatappropriate health management steps can be taken prior to onset, (iv)tracking on-going advances in research tailored specifically forindividuals based on the their genotypes and other personal information,and (v) providing information on these on-going advances to theindividuals and their physicians and medical/health care counselors. Thepractice of the present invention is applicable to the provision ofgenetic profile-based health management products and servicescharacterized by logistical efficiencies and economies of scale as wellaccuracy of analysis, interpretation and reports, and the effectivenessof such tracking of related scientific research and medical advances,and which can be offered in accordance with applicable health care,privacy and related laws and regulations. Moreover, the practice of thepresent invention is a valuable tool for pharmacogenomics.

These and other advantages and features of the present invention will bedescribed more fully in a detailed description of the preferredembodiments which follows.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates the way lexica, thesauri, and ontologies are used tosemantically classify and define genomics data. (Modified from AshburnerM et al., Gene ontology: tool for the unification of biology. The GeneOntology Consortium, Nat Genet 25: 25-29 [2000]).

FIG. 2 illustrates database schematic of the invention.

FIG. 3 represents BIOSQL schematic of the invention.

FIG. 4 represents a schematic overview of the SNP table of theinvention.

FIG. 5 represents a schematic overview of the DiseaseGene module of theinvention.

FIG. 6 represents a schematic overview of the OMIM-R module of theinvention.

FIG. 7 represents a portion of a MeSH disease ontology (“C04”), whichhas been populated with human disease genes.

FIG. 8 shows a variation map for the MTRR gene. Nonsense codingmutations are shown in dark arrows (e.g., “Arg114Stp”), missense codingmutations are shown in grey arrows (e.g., “Ile22Met”) and silentmutations are shown with light grey arrows (e.g., “Leu179Leu”). Intronicmutations are shown with light grey arrows and upstream and downstreammutations within ±5 kb are shown with black arrows at the top of FIG. 8(original code from Stein L et al., WormBase: network access to thegenome and biology of Caenorhabditis elegans, Nucleic Acids Res 29: 82-6[2001]).

FIG. 9 (A-D) shows an exemplary list of 2236 genes that can be included,in any combination of subsets, in a preselected set of markers inaccordance with the invention.

FIG. 10 shows an example report.

FIG. 11 shows an example detailed report.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention relates to a method for determining whether anindividual has an enhanced, diminished, or average probability ofexhibiting one or more phenotypic attributes. In accordance with themethod, genomic markers from an individual are evaluated for theirzygosity at each member of a preselected set of markers.

For purposes of the present invention, an “individual” can be of anyspecies of interest, including a human or a non-human species.

For purposes of the present invention, an enhanced, diminished, oraverage probability of exhibiting one or more phenotypic attributes, ora “relative probability,” is with respect to the general population in aparticular geographical area or areas, or with respect to a definedsubpopulation thereof, for example, but not limited to, a particulargender, age grouping, or ethnicity, or some other identifying feature.

“Zygosity” of a marker is its genotype with respect to any combinationof heterozygosity or homozygosity for one or more possible alleles at alocus, including, but not limited to, a condition of heterozygosity orhomozygosity for a particular allele or alleles between chromosomes, orallele heterozygosity among several possible alleles at a locus on onechromosome.

In accordance with the present invention, the preselection of the set ofmarkers is based on genotype/phenotype associations with diseaseconditions or predispositions for disease conditions. The association ofgenotypes with phenotypes and further associating such phenotypes withdiseases and predispositions includes utilization of existing data inthe relevant literature and can also include data acquired fromadditional studies incorporating the practice of the present invention.FIG. 9 shows an exemplary set of 2236 genes that can be included (in itsentirety or in any combination of subsets) in a preselected set ofmarkers in accordance with the invention. The skilled practitioner isaware of other markers that can be included in the preselected set ofmarkers.

In one embodiment, preselecting the set of markers can be achieved usinghaplotype mapping to capture the majority of historic recombinationevents that have occurred since the most recent common ancestor of thesample group or population analyzed. An example of such a haplotype mapis by the National Institute for Genome Research, which comprises a setof a few hundred thousand polymorphic markers covering the entire humangenome with sufficient density along each chromosome to measure thediversity of common haplotypes prevailing in each local region.Nevertheless, haplotypes also can be defined as a block or region of DNAthat has been inherited as a unit. SNPs within that block of DNA willalso have been inherited together. These SNPs can then be used toidentify the presence of a particular block. This block can have anysize from a few base pairs to large hundreds of kilobases ofnucleotides. (Stephens J C, Mol Diagn. 4(4):309-17 [1999]). In oneembodiment of the present invention these markers are used to findcommon alleles or variants associated with complex diseases and are usedto interpret local variation patterns and provide information on theevolutionary origin of known polymorphisms that are the cause offunctional differences between alleles or variants. The markers areprioritized in order to construct or “preselect” a reduced marker set toavoid the data management and statistical problems that attend wholegenome scans using hundreds of thousands of genetic markerssimultaneously. In a preferred embodiment of the present invention,certain genotypes are associated with certain phenotypes.

In another embodiment, the predicted and/or measured effect of differentgenetic markers on the DNA/RNA sequence changes that occur in thesplicing process at exon/intron junctions is used to classify andpreselect the set of genetic markers. Such classification can beaccomplished by scoring each genetic marker candidate through theapplication of one or more predictive models applied to the data in amultivariate scoring matrix for that candidate. In this embodiment, allsplice sites relevant to a variant are first identified by mRNAidentification and alignments to the genomic DNA either through existingannotations or new annotations, thereby enabling both informatic andchemical retrieval of 5′ and 3′ splice site sequence regions. Second,variants with SNPs within these splice site regions and the associatedsites are identified by the integrating and mapping of public andprivate mutations within these splicing associated sites. Third, thewild-type or original sequences and the mutated sequences are scoredwith one or more predictive models selected from the universe of:statistical models, such as weight matrix, Bayesian, hidden Markovmodels, semi- or general hidden Markov models; artificial intelligencemodels; and discriminative and predictive models related to bindingaffinity. This process results in splicing strength scores for theoriginal variant and the variant with the SNP. The difference of thesescores or any other metric applied to these scores can then be used topreselect markers for genotyping analysis and to score, rank, andprioritize factors in the construction of a multivariate scoring matrix.

In one embodiment of the present invention, the preselected set ofgenetic markers comprises a plurality of exon/intron junction sequences.In another embodiment of the present invention at least about 20% of thegenetic markers in the preselected set of genetic markers areexon/intron junction sequences. In another embodiment of the presentinvention at least about 40% of the genetic markers in the preselectedset of genetic markers are exon/intron junction sequences. In anotherembodiment of the present invention at least about 60% of the geneticmarkers in the preselected set of genetic markers are exon/intronjunction sequences. In another embodiment of the present invention atleast about 80% of the genetic markers in the preselected set of geneticmarkers are exon/intron junction sequences.

In addition to applying predictive models to such splice sites forpurposes of preselecting genetic markers and for constructing genotypingassays, such predictive models can also be applied for such purposes tolarge-scale promoter regions of genomic DNA that can contain one or moreSNPs, functional binding sites within the 5′ or 3′untranslated regions(“UTRs”), RNAi-genes, and miRNAs. In a preferred embodiment of thepresent invention, a method is provided of subjecting genetic markerswithin promoter regions to the same analytic selection process asspecified above for splice sites. In one preferred embodiment of thepresent invention, the preselected set of genetic markers comprises aplurality of promoter sequences. In another preferred embodiment of thepresent invention, at least about 20% of the markers in the preselectedset are promoter sequences. In another preferred embodiment of thepresent invention, at least about 40% of the markers in the preselectedset are promoter sequences. In another preferred embodiment of thepresent invention, at least about 60% of the markers in the preselectedset are promoter sequences. In another preferred embodiment of thepresent invention, at least about 80% of the markers in the preselectedset are promoter sequences.

In another embodiment of the present invention, genetic markers ofinterest are identified and selected for inclusion in a preselected set,based upon a matrix with dimensions comprising genotypic and phenotypiccriteria. These criteria can include, among others: base pair sequencehomology to another known genetic marker sequence of interest; thepresence of two or more regions of DNA on the same chromosome or geneticmarker (synteny); relevance to the description of the molecularfunction, biological process and cellular component of the protein codedby the gene under investigation (ontology) and ontologicalclassification; conservation of mutated sequence sites at conserved orless conserved sequence homology sites in the genome; quality ofresearch on the genotype, genetic marker and phenotype underinvestigation; biological significance of the genetic marker (forexample, whether the marker specifies a protein coding change); andregulatory value and classifications of the amino acid(s) specified bythe genetic marker.

“Lexica”, “thesauri”, and “ontologies” are used to semantically classifyand define genomics data. A “lexicon” is a list of terms belonging tothe same semantic class: BMP4 and DPP, for example, both belong to thesemantic class of “BMP” (i.e., bone morphogenetic factor). A “thesaurus”provides a listing of the synonyms for a term, or semantic class, andhierarchical “ontologies” are used to “define” the terms contained in alexicon and a thesaurus. Long a cornerstone of computer science,ontologies have recently become a major focus of research inbioinformatics. Like controlled vocabularies, ontologies also enabledata sharing but, because they contain hierarchical relationshipsbetween their terms, ontologies enable logical inference and deduction(see FIG. 1) on the data they contain, making them powerful tools forhypothesis generation. The definition of a term is produced by tracingthe path from a term to the root of the ontology (FIG. 1, panel c, pathstarting from “BMP”). The simple ontology shown in panel c of FIG. 1,for example, defines “BMP” as “a TGF-β growth factor”. Definitions applyto all members of a semantic class and their synonyms, and can be usedas a basis for logical inference: e.g., “DPP is a DVR, a DVR is a BMP, aBMP is a TGF-β, and a TGF-β is a growth factor”; therefore it can beinferred that “DPP is a growth factor” even if no document explicitlystates this fact. Note that the ontology shown in panel c of FIG. 1 is aparticular type known as an ‘isa-hierarchy’; other types of ontologiesexist, not all of which are suitable for definition.

In one embodiment the selection of markers is based on microsatellitemarkers. Microsatellite include simple sequence repeats, short tandemrepeats and simple sequence length polymorphisms that are characterizedas relatively short tandem repeat nucleotide sequences, e.g. (TG)n or(AAT)n, generally less than 5 base pairs in length. Microsatellitemarkers have adequate physical densities for family-based (linkage)studies where the number of recombination events between closely relatedfamily members is small, hence markers even as distant as mega basesapart are likely to be co-inherited. In addition, microsatellite markersare generally useful in genetic studies because of theirhypervariability, co-dominance and reproducibility of microsatellitemarkers.

The present invention is also directed to a method of selecting a set ofgenetic markers which can be employed, for example, in preselecting theset of genetic markers in accordance with the method for determiningwhether an individual has an enhanced, diminished, or averageprobability of exhibiting one or more phenotypic attributes. Selectionof a set of genetic markers according to the present invention resultsin large set of genetic markers that map to splice site regulation lociand promoter loci for gene clusters and gene families associated withspecific phenotypes as well as complete genome-wide splice site andpromoter site genetic marker profiles.

Selecting the set of genetic markers is optimized by a meaningful markerprioritization, which requires the incorporation of additional facets ofinformation to characterize the local genome context. One usefulapproach is to employ existing lists of candidate genes thought to beinvolved in a given disease. By computational mining of the diseaseliterature, identifying sets of genes already implicated in relatedclinical phenotypes (or homologous genes involved in related phenotypesin model organisms), and by locating disease loci and disease-causingmutations within the common frame of reference of the genome sequenceone can hope to extend these lists. Although extended, such lists stilldrastically limit the number of loci one needs to scan for diseaseassociation, while they are, hopefully, still inclusive enough to reducethe risk of omission of true causative loci. Within the targetedregions, typically a gene locus, one has the choice of either includingall known polymorphic markers, or trying to make further reductions.

Some other useful considerations include marker location relative to thefunctional units of the gene (coding, UTR, splice site, regulatory,intron, etc.), marker zygosity (related to population frequency), orease of assay development (local repeat structure, sequence compositionfor oligo design). Typically, focusing on extended candidate listsreduces the DNA search space from the entire genome to about onethousand to a few thousand loci, or roughly 5% of the genome. Additionalpruning can introduce an additional 5-fold reduction. The result is ascenario where, at the cost of genotyping 1% of all the markersavailable for a genome-wide scan, one types all candidate loci, with thereasonable expectation that at least half of the causative loci areinvestigated. In addition to the reduction in genotyping cost, anadvantageous reduction of false positive associations results, whichfalse positive associations have been known to plague even very largeclinical association studies and have been cited as an important causeof irreproducible reports of disease association.

In a preferred embodiment of the present invention, the preselected setof markers comprises genetic markers that map to at least about 1,000discrete loci. In another preferred embodiment of the present invention,the preselected set of markers comprises genetic markers that map to atleast about 2,000 discrete loci. In another preferred embodiment of thepresent invention, the preselected set of markers comprises geneticmarkers that map to at least about 3,000 to about 5,000 discrete loci.In another preferred embodiment of the present invention, thepreselected set of markers comprises genetic markers that map to atleast about 5,000 to about 10,000 discrete loci. In another preferredembodiment of the present invention, the preselected set of markerscomprises genetic markers that map to at least about 10,000 to about15,000 discrete loci. In another preferred embodiment of the presentinvention, the preselected set of markers comprises genetic markers thatmap to least about 15,000 to about 30,000 discrete loci.

In accordance with the invention and the method for determining whetheran individual has an enhanced, diminished, or average probability ofexhibiting one or more phenotypic attributes, genomic markerscorresponding to at least the preselected set of genetic markers from anindividual are evaluated for zygosity. For purposes of the presentinvention, zygosity for a marker is detected in a biological samplecollected from the individual that contains the individual's genomic DNA(such as, but not limited to, a blood, saliva, or tissue biopsy sample,which biological sample can be freshly collected or suitably stored topreserve the DNA) by employing suitable biochemical genotypinganalytical assay means. Analytical hybridization or polynucleotidesequencing means are typically employed, optionally after amplificationof DNA in the biological sample, for example, by using PCR-basedamplification means. High throughput analyses can optionally be achievedby multiplexing techniques known in the art. The genotyping analyticalassay means can optionally be performed with commonly available roboticapparati and/or very dense array detection apparati.

The determined zygosity of the markers for the individual is compared tothe multivariate scoring matrix to obtain a marker score, wherein themultivariate scoring matrix correlates patterns of marker zygosity withprobabilities of exhibiting phenotypic attributes. In accordance withthe invention, the multivariate scoring matrix correlates patterns ofmarker zygosity with probabilities of exhibiting phenotypic attributes,based on scoring matrix vectors that can include descriptors of: familyhistory, general medical physiological measures or values (such as, butnot limited to, cholesterol levels, triglyceride levels, blood pressure,heart rate, HGH or other hormone levels, red blood cells, bone density,CD scan results, etc.), mRNA expression profiles, methylation profiles,protein expression profiles, enzyme activity, antibody load, and thelike. The comparison with the multivariate scoring matrix can be donemanually or, preferably, by employing a suitable computer softwareinstantiation in which the multivariate scoring matrix isalgorithmically constructed and manipulated via a programming language,for example, but not limited to, Java, Perl, or c++.

Based on this comparison it is determined whether the marker scoreindicates an enhanced, diminished, or average probability of exhibitingone or more phenotypic attributes, relative to a reference population,e.g., the general population of a chosen geographical area, or anotherchosen subpopulation thereof in terms of ethnicity, gender, age, orother identifying feature of interest.

The present invention further provides a means for providing relevantgenetic information to an individual whose genetic profile has beendetermined in accordance with the present invention.

The method for providing relevant genetic information to an individual,and for generating information reports to be communicated, includesidentifying the individual's genotypic characteristics by correlatingthose genotypic characteristics with a relative probability ofexhibiting one or more phenotypic characteristics and determining foreach of the one or more phenotypic characteristics whether theindividual has an enhanced, diminished, or average probability ofexhibiting the phenotypic characteristic, in accordance with the presentinvention as described hereinabove.

The method for providing relevant genetic information to an individualthen includes applying one or more selection criteria for each suchphenotypic characteristic, wherein each selection criterion imposestotal, partial, or no limitation on the information communicated to theindividual, identifying information that is relevant to the individual'sprobabilities of exhibiting such phenotypic characteristics andconsistent with the limitations imposed by the selection criteria, andcommunicating the information to the individual in a report as describedabove.

In one embodiment of the present invention, the same or differentselection criteria are applied one or more additional times to thedetermined probabilities of exhibiting each of the phenotypiccharacteristics, information is identified that is relevant to theindividual's probabilities of exhibiting the one or more phenotypiccharacteristics and consistent with the limitations imposed by theselection criteria, and the information is communicated to theindividual.

In one embodiment of the present invention, at least one of theselection criteria is specified in advance by the individual.

In another embodiment of the present invention, at least one of theselection criteria is a function of the availability of treatmentseffective to modify the phenotypic characteristic.

In other embodiments, at least one of the selection criteria is afunction of the scope and quality of known research relating to thephenotypic characteristic; or at least one of the selection criteria isa function of the probability determination(s) for one or more otherphenotypic characteristics.

The information is then communicated to the individual, whetherdirectly, or indirectly via an appropriate intermediary, in a reportthat generally includes an explanation of relevant terminology; asummary of the genetic profile of such individual subject; anexplanation of each genotype assay performed on biological samples fromthe individual, typically as categorized by types of disease (such as,but not limited to, cancer, cardiovascular disease, and the like), andin relation, where applicable, to specific body organs, tissues andmetabolic, reproductive or other bodily functions and systems; a summaryand detailed results for each such genotype assay performed; a healthrisk appraisal for such individual subject; general information aboutgenetics and genomics; and references for further information. Suchreport can include summary sections of genes and gene families importantto individuals based on their specific phenotypic impact and thedemographic, ethnic, gender, age, and related characterizations of eachindividual.

For example, the categories for such characterization can include, butare not limited to, aging, women's health, and drug interactions. Thesesummaries can include (i) an assessment of the overall genetic health ofthe individual in each health category presented and the overall genetichealth of the individual and (ii) cross references to the more detailedinformation within the report.

The information report can be communicated orally, but is preferablycommunicated in a documentary format, whether written or electronic, andin any suitable order of report presentation. Prior to communicating theinformation to the individual, the information is preferably, but notnecessarily, formatted to present the relevant phenotypic attributesaccording to an organizational matrix, wherein the organizational matrixdetermines the grouping and presentation of information to theindividual. More preferably the organizational matrix groups the variousphenotypic characteristics for which the individual has an enhancedprobability together. In some embodiments the organizational matrixgroups phenotypic characteristics related to similar physiologicalsystems together. In other embodiments, the organizational matrix ranksthe phenotypic characteristics as a function of the potential impact onthe individual's lifestyle or quality of life, or the organizationalmatrix ranks the phenotypic characteristics as a function of the“genomic ethnicity” of the individual, as described herein.

In a preferred embodiment of the present invention, the informationreport relates to a broad selection of diseases. Such diseases mayinclude, among others, cancer and those relating to the followingorgans, tissues, and metabolic, reproductive and other bodily functionsand systems involved in human health, including, but not limited to,cardiovascular, respiratory, kidney and urinary tract; immune system,gastrointestinal, neurological, psychoneurological, and hematologicalfunctions and systems.

In a preferred embodiment of the present invention, the report for eachsuch disease, disorder and other adverse medical condition comprisesinformation about the the relevant genes, the sites assayed in suchgene, the clinical association of variants at such sites with relevantdisease, the genotype of the individual subject at each such site,information about the association of such genotype with the phenotypeassociated with such disease, and information about drug interactionsfor the individual subject based on such genotype.

Prior to communicating the information to the individual, the identityof the individual need not be associated with the data corresponding tothe genotypic characteristics, the relative probabilities of exhibitingthe phenotypic characteristics, or the identified relevant information.A suitable coding or “blind” system can be employed to shield theidentity of the individual, if appropriate.

The present invention provides that the format and substance of theinformation report can be modified from time to time on an ongoingbasis, particularly in response to questions from individuals,physicians and medical counselors about genetic markers, genotypes,genotype assays, the association of genotype with phenotype anddiseases, and other aspects of such reports. In a preferred embodimentof the present invention, such questions are stored in a database andthe overall effectiveness of communication of the text and graphicspresented in such reports are from time to time assessed. In a preferredembodiment of the present information, the report generator isconstructed so that changes to the information report format andsubstance can be effected efficiently and economically.

The present invention also relates to a method of evaluating the geneticprofile combination (i.e., “compatibility”) of two individuals of theopposite sex of the same species, in other words, a method of evaluatingthe probability that progeny of two individuals of the opposite sex willexhibit one or more phenotypic attributes. The method is applicable toproviding genetic counseling to pairs of individuals of the opposite sexrelating to preselected risks engendered by their respective genotypes,or genetic profiles, to their progeny, whether prospectively or afterthe birth of the progeny. Alternatively, the method is applicable toproviding genetic counseling to a recipient parent (or recipientparents) of progeny resulting from, or intended to result from, the useof donor gametes, e.g., donor sperm and/or ova.

The method involves evaluating genomic markers from each of the twoindividuals whose gametes are to contribute, or contributed, to thegenetic inheritance of the progeny by the formation of a zygote, i.e.,the genetic or so-called “biological” parents. Zygosity at each memberof a preselected set of markers is evaluated, and a determination ismade of a probability distribution for the zygosity for each member ofthe preselected set of markers in the genomes of the progeny of the twoindividuals. The probability distributions are compared to amultivariate matrix to obtain a probability distribution score. Themultivariate matrix correlates patterns of marker zygosity withprobabilities of exhibiting phenotypic attributes, as describedhereinabove. Then it is determined whether the probability distributionscore indicates that the progeny of the two individuals would have anenhanced, diminished, or average probability of exhibiting one or morephenotypic attributes.

The present invention further provides a method for determining thegenomic ethnicity of an individual for the purpose of determining thelikely applicability of clinical research results based upon the geneticprofile of the individual. Clinical research results are often onlyfound in particular populations or subpopulations, and these results arefrequently correlated with the ethnicity of the population orsubpopulation. “Genomic ethnicity” means a genetic profile of anindividual having a distribution of genetic markers in a preselected setof markers, based on genotype/phenotype associations with diseaseconditions, that is preponderantly consistent with the distribution ofthose markers that is determined to be or is known in the art to becharacteristic of a particular ethnic population or subpopulation. Thedetermination of genomic ethnicity in accordance with the method canoften provide more useful information than an individual's mereself-reporting of ethnicity.

The method involves evaluating genomic markers from an individual ateach member of a preselected set of markers, comparing the genotype foreach of the markers to a multivariate matrix, wherein the multivariatematrix correlates patterns of genotypes with probabilities of exhibitingparticular phenotypic attributes, and determining the genomic ethnicityof the individual as a pattern of the probabilities of exhibiting thephenotypic attributes.

In should be borne in mind that the genotype-phenotype relation requirestwo axes, the phenotype (e.g., a particular disease) and the mutations.Most known disease-causing mutations are found in single-gene disordersbecause these have been the easiest to find. Most of the OMIM diseaseentries are in this class; hence most OMIM mutations cause single-genedisorders. These disorders are usually very rare in the generalpopulation (although can be much more prevalent in a given family orpopulation group). For diseases in this group, allele state oftenpredicts disease with an almost certainty. But there are more complexscenarios: (i) with the cm-rent state of knowledge, genotype-phenotypeassociation can only be cast in statistical terms (i.e. having the “bad”allele does not necessarily mean the individual will ever develop thedisease, but means a heightened chance or predisposition, relative tothe general population or a defined subpopulation, of developing it inthe future), or (ii) the association is only seen in some studies butnot in others, or (iii) the association is only seen in a single family,in a population isolate, or one of the world populations, with noknowledge of relevance for other populations. Also typical in thesecases is that the associated marker is almost certainly not itselfcausative, merely an associated marker for the disease. In accordancewith the methods of the present invention, each of these scenarios aretaken into account as appropriate in constructing the multivariatematrix, or multivariate scoring matrix, and as the available informationimproves concerning genotype/haplotype relationships with a givenphenotype, the multivariate matrix, or multivariate scoring matrix, canbe updated, expanded, thereby enhancing its predictive value whenemployed within the inventive methods. However, the practice of theinventive methods is not limited by the precise contents of themultivariate matrix at any given time, which will vary as newinformation is incorporated.

The information about the genetic profiles of individuals for whomgenetic profiles have been generated of the type produced in accordancewith the present invention can facilitate delivery of appropriatetherapies to the individuals, can enable the further identification ofnovel genotype-phenotype relationships to support the discovery of newtherapies, and can provide novel indicators of health. Thus the presentinvention can be applied to pharmacogenomic analysis and the developmentof more effectively directed drug therapies.

In another application, the information generated in accordance with thepresent invention can be maintained in an updatable database. Thedatabase can be constructed using widely available database managementtools, such as Oracle or open source database tools such as postgres.The information in the database can be used to select individuals forwhom investigational therapies or newly commercially available therapiescould probably be most effective, based on their genetic profiles. Forexample, such selection can be made positively on an indication thatindividuals with a disease or predisposition to disease can benefit fromparticular therapies, or negatively on an indication that suchindividuals are at risk with a particular therapy. Such classificationschemes can be used to identify subsets of populations with complexdisease for purposes of course of therapy decisions.

EXAMPLES Example 1 Use of a Genetic Marker as an Index to Literature

A genetic marker, and a variant found at that marker, can becharacterized by a number of factors. These can include a uniquesequence surrounding it, a fixed offset from a known reference sequencepoint, or a specific amino acid change in a particular protein. Theclinical literature refers to markers, mutations, or polymorphisms in avariety of different ways. In accordance with the present invention,individual markers and variants can be characterized using these andother attributes in order to use them to index clinical literature. Thisindex can be constructed of a single marker, or of a set of markers. Byindexing literature in this way, a consolidated report can beconstructed based upon these markers. If for example, there exists agenotyped, or sequenced database of markers that have been found to bepresent in an individual, this marker database can be used to create anextract of clinical literature that is particularly relevant to theindividual. This use of an individual's markers or variants as aliterature index, is a useful component in providing personalizedcommunication of information from literature and other databases, basedupon an individual's genetic profile.

Example 2 Association Representation

As genetic disease associations become more complex, interpreting theimplications of these associations becomes more complicated. The overallquality of the association, in a number of dimensions, becomesincreasingly important. In addition, in order to understand theimportance of a particular association to an individual, an association,in all of its texture, needs to be viewed in the overall context of theindividual's genetic profile. Characterizing and representing thisquality, both independent of the individual and in the context of theindividual's genetic profile become fundamental to the appropriate usageof the information.

The factors that shape the implications of a genetic association arediverse. They range from the causal linkage seen with the association(e.g., the percentage of disease cases that are associated with aparticular association and the percentage of individuals with a givenmarker who develop the disease), to factors related to the breadth ofthe research (e.g., number of studies, number of individuals studied,and number of ethnicities studied). By characterizing and representingthe associations in this manner, individuals, their counselors, andtheir health care providers can better understand, appropriately value,and make use of them.

One method of representing the quality of an association, for thepurposes of analyzing, evaluating, determining, identifying, comparing,or formatting the information and/or of communicating or presenting theinformation to the individual, is to place associations on a twodimensional grid. The axes of this grid would be the percentage ofdisease cases that are associated with a particular association and thepercentage of individuals with a given marker who develop the disease.Each association would be represented, e.g., by a colored circle, onthis grid; the color can represent the number of studies of thisassociation, with for example, red representing a low number, yellowrepresenting an intermediate number and green representing a largenumber of studies. The darkness of the circle can represent the numberof ethnicities studied, and the size of the circle can represent thenumber of individuals. In this way, at a glance, a viewer can understandhow a particular association compares to others, and the overall qualityof the association itself. For example, a light, red, circle close tothe origin is a very weak association. A dark, green circle in the topright of the graph is very significant, well-studied in largepopulations.

The same graph can be constructed of associations for those markers thathave been found in an individual's genetic profile. With this, anindividual or his or her counselor or health care provider isimmediately focused on those associations that are most significant.

Of course, this multi-dimensional representation can take many graphicalforms. An alternative method might use a radar graph. A radar graph is atwo dimensional polar graph that enables one to simultaneously displaymany variables. It does this by plotting each variable along a differentradial axis emanating from the origin of the polar plot. If one has tenvariables, then there will be ten radial axes thirty six degrees apart.Small values are near the center of the polar plot and large values nearthe outer circumference. An experiment might result in multiple measuredvariables and the investigator wants to compare the results of thisexperiment repeated under different conditions. If one connects thevariable values on each radial axis with a straight line then adistorted star-like pattern (or “polygon”) is created. One can thenvisually compare the patterns created by the lines representing thedifferent conditions for the experiment. In addition, several polygonscan be plotted on the same set of radial arms through the use of onecolor or line pattern per polygon.

The preceding are illustrative, and not exhaustive examples, of howgraphical representation of a number of factors that characterize agenetic association can support an enhanced understanding of theassociation, both individually and in the context of other associations,as well as independently and in the context of an individual's geneticprofile.

Example 3 Characterizing Patterns of Associations Found in anIndividual's Genetic Profile

Genetic profiles, defined as a set of markers found in an individual'sgenome ranging from a small set of markers to the entire geneticsequence, are used to generate or predict attributes of individuals thatcan be used for a variety of applications. These applications can rangefrom identity verification to determining disease predisposition anddrug interactions to feature prediction (e.g., height, hair color or eyecolor). In addition, the attributes can range from being absolutelycorrelated to the genetic profile, to being shaped by the geneticprofile, to being unpredictable from the genetic profile. Understandingand characterizing the degree to which a genetic profile specifies anattribute has great utility. For example, the generation of a profilefrom DNA of a suspected criminal is much more useful when the degree towhich the feature can be predicted is well-understood. If red hair isalmost assured, but height much less specified by the profile, searchescan be better refined. Capturing and representing this informationbecomes much more important as the number of attributes to be analyzedgrows. The methods that can be used to extract and capture thesepatterns include statistical models, Bayesian networks, hidden Markovmodels, and other machine learning techniques.

Example 4 Constructing Multivariate Matrix

A goal of the present invention is to bridge the time gap that currentlyexists between the latest research advances and the final benefit to theconsumers in the health care system. Through genetic testing and theinformation delivery this gap can be bridged and shorten by a hugefactor. For example, individuals can be tested for a newly identifiedfunctional polymorphism (e.g., the RANTES gene; Hizawa, N. et al., Afunctional polymorphism in the RANTES gene promoter is associated withthe development of late-onset asthma, Am J Respir Crit Care Med 166:686-90 [2002]; see Table 1 below), information about which can be addedto the multivariate matrix, or multivariate scoring matrix in accordancewith the present invention. Suitable genotyping assays can be developedfor the newly identified polymorphisms described in the original articlein a matter of weeks. After the genetic testing with the genotypingassay, the result can be communicated to the tested individual about heror his genetic profile and the relevant clinical phenotype informationlike specific risk factors—in the RANTES gene example its relationshipwith asthma—that are correlated with a specific genetic profile. Thecomplexity of the clinical information for the RANTES gene is notlimited to this article but there exist many more articles describingthe relationship between genetic markers in the RANTES gene and aclinical phenotype. One useful database to capture more comprehensivelygenetic literature information concerning a disease can be found in the“Online Mendelian Inheritance of Man” (OMIM) database (Hamosh, A et al.,Online Mendelian Inheritance in Man (OMIM), a knowledgebase of humangenes and genetic disorders, Nucleic Acids Res 30: 52-5 [2002]; McKusickV A, Online Mendelian Inheritance in Man, OMIM (TM). McKusick-NathansInstitute for Genetic Medicine, Johns Hopkins [2000]). The OMIM databasecontains textual information and references on inherited diseases andgenetic disorders. It also contains copious links to MEDLINE andsequence records in the Entrez system, and links to additional relatedresources at NCBI and elsewhere. This database is accessible on theinternet (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM), as well as ina hard copy book. The electronic version is also distributed in XMLformat under an NIH license from the National Library of Medicine. Whilethe OMIM database is very useful for clinical and academic research itis still far too technical for an individual and her/his healthcareprovider (see as an example the OMIM entry for the RANTES gene in Table2 below).

A prototype system was developed that allows scientists to positionalleles described in the Online Mendelian Inheritance in Man (OMIM)database on the build 33 assembly of the Human Genome from the NationalCenter for Biomedical Information (NCBI). Once placed on the assembly,the system automatically classifies the alleles within the context of agene and its genomic structure; i.e., coding(synonymous/non-synonymous), UTR, etc. The OMIM database is currentlythe most comprehensive and relevant database for human genetic disease,and we have therefore focused our efforts on mapping the OMIM alleles tothe genome. Toward this end we have developed a ‘deterministic’algorithm for OMIM allele mapping. Most of the alleles described withinOMIM are clinically relevant, and have links to the appropriatescientific publications; in an attempt to better manage the clinicaldata associated with these alleles, we have also developed a concise andhigh-level disease ontology.

All five subsystems of the prototype system have been tested with a setof known, well-characterized disease genes. Specific care and focus havebeen given to the user-friendliness and user interaction of theannotation station, an extendable and flexible design for the databaseinfrastructure and a practical approach for the OMIM allele mutationmapping algorithm, to capture as many OMIM mutations within the genomeas possible, by avoiding false positives. During the genetic diseaseontology development, we largely made use of existing medical textbookclassifications, due to the widespread familiarity of theseclassifications within the medical community. The following sectionsdetail the software and the science developed.

(a) Prototype Visualization Annotation Station

We built a prototype SNP and mutation oriented genome annotationstation, in which we leveraged existing open source genome browsersoftware as developed in the wormbase project (Stein, L. et al.,WormBase: network access to the genome and biology of Caenorhabditiselegans, Nucleic Acids Res 29: 82-6 [2001]). For additional datamanipulation work we have used PERL code from the open source BioPerlproject (Stajich, J E et al., The bioperl toolkit: perl modules for thelife sciences, Genome Res 12: 1611-8 [2002]). The annotation station iscentered on genes, and with a gene symbol the genomic region for thespecified gene can be viewed.

Our “GeneViewer” can display different types of genomic sequencevariations within a graphical user interface. The display now showsgraphically a genomic gene region with 5 kb sequence upstream anddownstream of the start and stop codon of a gene. It also displaysneighboring genes that are located within this +/−5 kb window. The genestructure is displayed using the “traditional” exon-intron viewer, inwhich exons are linked with arcs that span the introns. Currently thedisplay color-codes and groups missense, nonsense, coding and non-codingmutations onto the gene structure. The variation symbols list thenucleotide and/or protein amino acid change in a small text glyph. Avariation map screen shot of this GeneViewer can be seen in FIG. 8,which shows the MTRR gene.

Following the GeneViewer is a short “Gene Summary”, which tabulates allthe general information about a gene, such as NCBI's LocusLinkidentifier(s), genomic Genbank records, RefSeq links, full textual genenames as well as internal disease classifications from our Ontology (seebelow) that are assigned by the annotators.

Following the Gene Summary is the “Variation Summary”, which seeks tolist in table format all known genetic markers within this gene fromdifferent mutation databases. This table currently lists OMIM allelesand dbSNP entries including their sequence position within the genomicsequence as well as the protein sequence. Furthermore, codon position,codon changes, and a short functional characterization (missense,non-sense, exonic, intronic, 5′UTR, 3′UTR, upstream or downstream) areincluded. A snapshot for the MTRR gene can be seen in FIG. 8.

In the next two sections, the “mRNA Summary” and the “CDS Summary”, rawmRNA and coding sequences (CDS) including the Genbank annotations. Ondemand, the genomic sequence can also be displayed.

The “Protein Summary” follows, which besides the translated codingregion lists under the wild type sequence the protein mutations causedby the sequence variation lists in the Variation Summary table. This isvery useful, because for each occurring nucleotide change the resultingprotein mutation can be seen.

At the end of each gene page, the corresponding OMIM entry is listedfrom a local OMIM copy from our database to avoid the reloading of everyOMIM entry from the NCBI site through the internet (“OMIM Summary”).This allows the annotators to review the verbal gene annotation withinthe same display.

The data required for effective and accurate annotation are diverse. Toaddress this diversity, the visual annotation station is split intothese different views. The prototype also allows to add otherinformation sources and even to generate new summary views if necessary.Improvements can be added, including various hyperlinks with theinterface, that can link to external (NCBI: dbSNP, RefSeq, Genbank) aswell as internal sources (OMIM, filtered version of PUBMED; see textmining algorithms below). For example, this can be addressed bydownloading the OMIM database from the NIH and incorporating the OMIMlocus entries within the genome annotation station. This allows tocross-link mutations within a gene document and get fast access to theallele variation description in OMIM on the same page.

(b) Prototype Annotation Pipeline Software

We have built a selection of PERL scripts to download the latest humangenome assembly as well as annotations from Genbank and RefSeq. Thecurrent prototype system has loaded the build 33 assembly, but is easilyupdated. Furthermore, dbSNP and the OMIM databases have been licensedand downloaded from the NCBI ftp site.

To capture and identify all known human genes predisposing to disease,we have begun with a widely accepted, recently published papertabulating and summarizing all known human disease genes(Jimenez-Sanchez, G et al., Human disease genes, Nature 409: 853-5[2001]). This set consisted of the non-somatic genetic associationscontained in the OMIM database. This set of 923 human disease genes wasexpanded with a collection of genes derived from an in-house analysis ofhundreds of current clinical research papers, selecting thoseassociations that were most significant and “actionable” by the clientsof our service. Any hint of an association between a novel disease geneand a clinical phenotype within an abstract triggered the collection ofthe paper from the library or online databases. The original papers werethen reviewed and if significant, the gene was manually added to ourcollection of disease genes. Additional papers that justify theselection of a certain gene into our disease gene set were noted asadditional evidence. A broad disease classification scheme (see“high-level disease ontology section”) was developed during this processas well. The total number of genes in our system is currently 1,732.Given that the current estimate of the total number of human genes isbetween 25,000 and 30,000, this would mean that between 5.5% and 7.0% ofhuman genes can be classified to be disease related. Of course, this isonly a snapshot of the current clinical research and we expect thepercentage to increase over time. 98% of these genes already had OMIMentries.

Of the 1,732 hand-selected genes, only 1,614 could be uniquely locatedin the latest genome assembly (see Table 3). The remaining 118 geneswere either annotated at multiple loci within the human genome or weremissing from both RefSeq and Genbank. The location of these 118 geneswithin the human genome sequence is next to be manually determined, asmany belong to small paralogous families that have significant clinicalimplications.

During the loading process of the flat file OMIM entries, we have parsedout the OMIM allele identifiers. It has been very difficult to developthis PERL parser due to the incredible mixture of formats for alleleannotations. The allele identifiers have then been used to locate theprotein mutations within the human genome assembly and the Genbankannotations. As can be seen in Table 3, 57.29% of the OMIM alleles couldbe localized directly within the genome assembly through the Genbankannotations using only the protein variation allele identifiers. Thisprocedure of comparing OMIM alleles with the Genbank annotations is aquality check for the Genbank annotations. The main reason why not allof the OMIM alleles could be easily mapped to the genome has primarilyto do with the different protein sequences, in which the first mutationshave been described. Case studies for some of these genes have shownthat the underlying protein sequence has usually been extended on the 5′end of the coding sequence over time. In previous publications only the3′ part of a gene sequence has been known—typically because of the early3′ EST sequencing efforts—and old, incorrect amino acids variationannotations have been propagated through the literature until today. Inan attempt to circumvent this problem we have developed a‘deterministic’ algorithm for OMIM allele mutation mapping; see below.

The integration of dbSNP polymorphisms has been straight forward,because they are linked to a specific gene sequence within Genbank usingthe dbSNP identifiers (“rs” and “ss” numbers). Surprisingly only a verysmall number, 1.87%, of OMIM alleles have a dbSNP entry (see Table 3).We speculate that there are two reasons for this. First, most of thealleles in OMIM are Mendelian mutations and have a very low minor-allelefrequency. Secondly, SNPs have only recently been discovered during thehuman genome project and the SNP consortium efforts and clinical studiesshowing clinical correlations are still under way and they thereforehave no allele entry within OMIM.

A preliminary study encouraged by the surprising low overlap betweenOMIM alleles and dbSNP entries performed using HGMD showed that OMIMalleles are represented in HGMD. We could find approximately 40% of theOMIM alleles from our 1,732 disease genes annotated in HGMD. This is amost promising finding and allows us to conclude that a lot ofinformation can be gained by using the annotations from HGMD to linkOMIM with the human genome assembly.

(c) Prototype Database Infrastructure

A prototype database infrastructure was built. Existing public domaindatabase schemas were adapted to fulfill specific variation relatedneeds. Furthermore, software was developed to load and update thespecific genome and clinical databases. All annotation pipeline softwareas well as the visualization annotation station are driven by thisdatabase system. The detail of the prototype database system is asfollows.

The database uses a loosely-coupled modular architecture (FIG. 2).Within each module, tables are tightly-coupled by means of relationalforeign keys. Between modules, tables are implicitly linked by means ofdbxrefs (database accessions) and gene symbols.

This modular architecture facilitates better software engineering anddatabase management. Modules can be swapped in and out easily. Loadinglarge data bulks or data management can be carried out on any moduleindependent of the others. For example, if NCBI releases a new humangenome assembly build, or a new dbSNP build, we can bulkload thiswithout perturbing our marker data.

We used the Postgresql version 7.3 relational database management system(RDBMS). Postgresql has the advantage of being open source, as well asextremely robust, with a large community of users and developers who areeager to provide support for free. It is much simpler to manage andadminister than commercial RDBMSs such as Oracle, and has manyadvantages over the other open source RDBMS like MySQL. We believe MySQLis not suited to a production environment because of its lack of fullSQL92 support, such as views, sub-selects and foreign key integrity.

We have taken care not to incorporate any Postgresql specificfunctionality into the design, which means that a port to a commercialRDBMS such as Oracle should in the future be a relatively uncomplicatedtask.

(d) BioSQL Module

BioSQL (obda.open-bio.org) is a third party open source database schemaand a PERL API (Application Programmer Interface). It is a genericdatabase of sequences and sequence features. It is ideal forrepresenting sequence data such as that from GenBank, the EMBL sequencedatabase, SwissProt and RefSeq. We used it to store the NCBI HumanGenome and NCBI human RefSeq. Data, loaded from NCBI flat file formatusing a PERL script that comes with the BioSQL distribution.

The BioSQL data model is based around the concept of ‘bioentries’ and‘seq_features’. This is equivalent to a Genbank/RefSeq record andfeature table entry, respectively. One of these records typicallycontains features of type gene, mRNA, CDS and variation. Theintersection between these locations tells us where the markers such asSNPs are with respect to exons, introns, untranslated regions (UTRs) andup/downstream regions. This data model is also suitable for housing thelocations of the HGMD and OMIM variations mapped onto genomiccoordinates. The most relevant part of the BioSQL schema can be seen inFIG. 3.

The snp table (FIG. 4) stores single base pair and other variationfeatures. Any one marker can have multiple effects, at the genomic,transcriptional and protein levels—this is what the snp_function tableis for. It allows us to easily query for markers that affect aparticular genomic area (exon, intron, UTR, protein, intergenic) or forprotein-affecting markers what is the amino acid modification. Each ofthese effects is with respect to a particular gene; we have a weak (anon-foreign key that is not enforced) link to the gene table (in theDiseaseGene module, see below) via the official gene symbol.

(e) DiseaseGene Module

The DiseaseGene module (FIG. 5) is primarily for storing literaturebased curation of genes implicated in diseases, and associatedinformation such as likely markers. Note that the gene_to_snp table isprimarily for storing information on the literature curator's notes onpossible markers, and is only weakly linked (implied via locationalcorrespondence) with the snp table in the so-named OMICIASNP module. Thegene table uses the official gene symbol; alternate symbols are storedin the gene synonym table. Each gene can have multiple diseasecategories, associated diseases or OMIM IDs.

(f) Phenotype Module

The characteristics of the Web based graphical user interface (GUI) thatsupport the manual mutation curation and mapping of phenotypeinformation to the current genome assembly, are of a client/serverdesign, with a graphical representation of a gene sequence with theassociated mutation data through a Web browser, delivered through anintegrated database. Using a web browser, the user interface can beaccessed internally and secured externally for annotation and reviewing.From a user's perspective, the diverse analytic tools and algorithms,described below, appear as a single unified application. Users are ableto edit the data and add annotations to the gene sequence throughout theprocess.

(g) OMIM-R

The OMIM-R module (FIG. 6) is a normalized relational model for housingdata imported from OMIM. This module is focused around the OMIM diseasetable, which has an OMIM identifier as a primary key; this also servesas a weak key for integration with other modules in the OMICIADB schema.The cryptically named columns are named after multi-valued OMIM fieldsof the same name. Each disease can have multiple phenotypes attached.Each disease can also have multiple known mutations, and each of thesecan have multiple amino acid modifications.

(h) DbSNP-xml Module

This is a fairly simple module, as we opted to store each dbSNP entry asa single denormalised XML ‘blob’, accessed by a dbSNP identifier. Thismodule is a single table, with a column for the dbSNP identifier, andthe XML stored as a text field.

(i) Dataflow and Software Architecture for Database Loading

OMIM records are parsed into XML and then loaded into the OMIM-R module;any mutational information from the OMIM allele entries is mapped fromprotein coordinate space onto genomic coordinates from Genbank, and theresulting features are stored in the BioSQL module. The NCBI humangenome build and RefSeq database are loaded into the BioSQL module.dbSNP is loaded into the dbSNP-xml module. Any disease gene informationfrom the literature curation process is parsed into XML and then storedin the DiseaseGene module.

After any of these steps, which can be carried out independently, allgenes in the DiseaseGene module are iterated through. For every gene welook for the corresponding gene annotation in BioSQL and find nearbymarker features. We instantiate marker entries for all of these markerswithin a gene in the so-named OMICIASNP module. We also create aso-named OMICIAGeneSummary XML file, which integrates information fromall modules into one XML document. The Annotation Station web interfacecode accesses this cached data for building the web displays. This isachieved through two Application Programming Interfaces (APIs)—the opensource BioSQL PERL API and a PERL API. A proprietary PERL API developedby Omicia (so-named “OMICIA API”) provides a unified access layer forall so-named OMICIADB modules and a simplified layer on top of theBioSQL API. However, the skilled artisan is able to construct his or herown API.

(j) Deterministic Algorithm for OMIM Allele Mutation Mapping

Many OMIM alleles could not be mapped directly, usually because some ofthe literature references were based on outdated versions of thereference sequence, or actually predated the availability of anyreference sequence. When less than 50% of OMIM alleles for a gene couldnot be mapped, a simple codon position shift/deterministic algorithm wasapplied to get a higher percentage of mapped OMIM alleles. The algorithmis as follows: For a given gene, all OMIM allele codon positions areordered and stored in a hash. Every neighboring OMIM protein allele iscompared to the original protein sequence from Genbank. If any of thesepairs match with the Genbank annotation, all other OMIM protein allelesare “shifted” according to the offset into the Genbank annotation aswell. By using this deterministic algorithm, an additional 12.11% ofOMIM alleles were mapped (see Table 4).

In order to evaluate software system, 97 genes were selected forannotation. The annotation consisted of reviewing the clinicalassociations for each marker within a gene, assessing it forsignificance, and then updating the database with a clinical referenceand rationale for selection. Biologists supported by computer scientistsperformed this annotation process. The process was assessed and modifiedthroughout the effort as bottlenecks were identified. More expeditiousapproaches were developed on the basis of overcoming these obstacles.Table 4 shows a summary of the selected 97 genes. OMIM had annotated atotal of 529 alleles for these sequences. This corresponds in average to5.45 allele annotations per gene. This is significantly higher than forour total set of disease genes, which had in average 3.25 alleles pergene (see Table 3). We believe this bias is due to the selectionprocess, by which we mostly used well-known and therefore well-studiedgenes, for which many clinical studies have been performed over theyears. Typical genes in our set include APOE, CFTR, APC, MTTR and allgenes starting with the letter “R”. Even with this result, about 30.60%of the OMIM alleles remain unmapped. More sophisticated algorithms canbe developed to map a higher percentage of the remaining OMIM alleles.

We found that on average, the annotation of each gene took approximately2 hours, but this time can be reduced through the use of text miningtools, and the development and refinement of a ranking scheme for themarkers, as the integration of the clinical and pharmaceuticalinformation, which is of great value to the individual client, is themost burdensome.

On the average 4.24 markers per gene were selected. The criteria formarker selection were based on the association significance of a markerwith a disease. Markers were selected that were believed to beinformative and useful for a typical individual. Based on this initialstudy, it is estimated that a set of approximately 7,344 (1,732 totaldisease genes×4.24) informative disease markers is supported by thecurrent clinical literature.

(k) a “High-Level” Genetic Disease Ontology

The genetic associations were placed into high-level categories, basedupon a leading medical text, Harrison's Principles of Internal Medicine.The categories used were: Cancer; Cardiovascular; Endocrinology andmetabolism; Gastrointestinal system; Hematology/blood disorders; Kidneyand urinary tract; Immune system; Neurological and psychiatricdisorders; Respiratory; and Others. This effort made clear that auniform terminology does not exist within the medical literature.Therefore, the associations are preferably categorized using the MeSHdisease ontology (see Example 5 below).

TABLE 1 A typical disease gene research abstract for the RANTES gene(published in American Journal of Respiratory and Critical Care Medicinecitation). Original Article A Functional Polymorphism in the RANTES GenePromoter Is Associated with the Development of Late-Onset AsthmaNobuyuki Hizawa, Etsuro Yamaguchi, Satoshi Konno, Yoko Tanino, EiseiJinushi and Masaharu Nishimura First Department of Medicine, HokkaidoUniversity School of Medicine, Sapporo, Japan Correspondence:Correspondence and requests for reprints should be addressed to NobuyukiHizawa, First Department of Medicine, Hokkaido University School ofMedicine, Kita-Ku, N-15 N-7, Sapporo, Japan 060-8638. E-mail:nhizawa@med.hokudai.ac.jp The CC chemokine regulated upon activation,normal T-cell expressed and secreted (RANTES) attracts eosinophils,basophils, and T cells during inflammation and immune response,indicating a possible role for this chemokine in asthma. Both the −403Aand −28G alleles of the RANTES promoter region exhibit significantlyenhanced promoter activity in reporter constructs in vitro. We thereforeinvestigated the genetic influence of these alleles on the developmentof asthma using a case-control analysis in a Japanese population (298patients with asthma and 311 control subjects). Given the evidence forheterogeneity of asthma according to age at onset, we divided patientswith asthm into three subgroups: 117 late-onset patients with asthma(onset at more than 40 years of age), 83 middle-onset patients withasthma (onset at 20 to 40 years of age), and 98 early-onset patientswith asthma (onset at less than 20 years of age). The −28G allele wassignificantly associated with late-onset asthma (odds ratio = 2.033; 95%confidence interval, 1.379- 2.998; corrected p < 0.0025) but was notassociated with the other two asthma subgroups. The −403A allele was notassociated with any of the asthma subgroups. Further evidence of theimportance of the −28G allele was a significant increase in theproduction of RANTES in vitro in individuals who carried this allele.Our findings suggest that, among Japanese, the −28G allele of the RANTESpromoter region confers susceptibility to late-onset asthma. Key Words:late-onset asthma • RANTES • single nucleotide polymorphism (SNP)

TABLE 2 OMIM entry 18701 for the RANTES gene. *187011 CHEMOKINE, CCMOTIF, LIGAND 5; CCLS Alternative titles; symbols SMALL INDUCIBLECYTOKINE A5, FORMERLY; SCYA5, FORMERLY REGULATED UPON ACTIVATION,NORMALLY T-EXPRESSED, AND PRESUMABLY SECRETED; RANTES T CELL-SPECIFICRANTES T CELL-SPECIFIC PROTEIN p228; TCP228 Gene map locus 17q11.2-q12CLONING Using a human cDNA library that was enriched by subtractivehybridization for sequences expressed by T lymphocytes but not Blymphocytes, Schall et al. (1988) isolated a gene (D17S136E), which theydesignated RANTES, that encodes a novel T cell-specific molecule.(RANTES is an acronym for ‘Regulated upon Activation, NormallyT-Expressed, and presumably Secreted.’) The gene product was predictedto be a 10-kD protein which, after cleavage of the signal peptide, couldbe expected to be approximately 8 kD. Of the 68 residues, 4 arecysteines, and there are no sites for N-linked glycosylation.Significant homology (30 to 70%) was found between the RANTES sequenceand several other T-cell genes, suggesting that they constitute a familyof small, secreted T-cell molecules. Schall et al. (1988) found thatRANTES, also designated p228 (TCP228), was expressed in 10 functionalT-cell lines, but not in 8 hematopoietic tumor lines or in 6 T-celltumor lines. Its expression was increased more than 10-fold inperipheral blood lymphocytes 3 to 5 days following mitogenic orantigenic stimulation. GENE FUNCTION CD8-positive T lymphocytes areinvolved in the control of human immunodeficiency virus (HIV) infectionin vivo. Cocchi et al. (1995) demonstrated that the chemokines RANTES,MIP-1-alpha (182283), and MIP-1-beta (182284) are the majorHIV-suppressive factors produced by CD8- positive T cells.HIV-suppressive factor activity produced by either immortalized orprimary CD8- positive T cells was completely blocked by a combination ofneutralizing antibodies against these 3 cytokines. On the other hand,recombinant forms of the 3 human cytokines induced a dose- dependentinhibition of different strains of HIV-1, HIV-2, and simianimmunodeficiency virus (SIV). Cocchi et al. (1995) speculated thatchemokine-mediated control of HIV may occur either directly, throughtheir inherent anti-lentiretroviral activity, or indirectly, throughtheir ability to chemoattract T cells and monocytes to the proximity ofthe infection foci. However, this latter mechanism may also have theopposite effect of providing new, uninfected targets for HIV infection.The authors noted that the findings may be relevant for the preventionand therapy of AIDS. Arenzana-Seisdedos et al. (1996) investigated aderivative of RANTES as a possible therapeutic agent for inhibition ofHIV infection. The derivative, called RANTES (9-68), lacks the first 8N- terminal amino acids and has no chemotactic or leukocyte-activatingproperties. RANTES (9-68) was a potent receptor antagonist and inhibitedinfection of macrophage-tropic HIV. The anti-HIV activity was somewhatlower than that of RANTES itself, which correlated with its loweraffinity for CC chemokine receptors. Arenzana-Seisdedos et al. (1996)found that the anti-HIV activity of RANTES and RANTES (9-68) showed somevariability depending on the donor cells. The authors concluded thatstructural modification of a chemokine can yield variants lackingactivation properties but retaining both high-affinity for chemokinereceptors and the ability to block HIV infection. Pritts et al. (2002)investigated the effect of PPAR-gamma ligands upon transcription andtranslation of RANTES in human endometrial stromal cells. Three putativePPAR response elements (PPREs) were found in the human RANTES promoter.In cells transfected both with RANTES promoter vectors containing 958 byand 3 PPREs, the addition of 2 PPAR-gamma ligands inhibited promoteractivity by 60% (P less than 0.01) and 48% (P less than 0.02),respectively. Truncation of the gene promoter to delete all putativePPREs abrogated the ligand-induced inhibition. Stromal cells showed a40% decrease in RANTES protein secretion when treated with a PPAR-gammaligand (P less than 0.01). The authors concluded that use of PPAR-gammaligands to reduce chemokine production and inflammation may be aproductive strategy for future therapy of endometrial disorders, such asendometriosis. MAPPING By analysis of somatic cell hybrids and by insitu hybridization using the cDNA probe, Donlon et al. (1990) assignedthe RANTES locus to 17q11.2-q12. A secondary hybridization peak wasnoted in the region 5q31-q34, which may represent the location of othermembers of the gene family. The region on chromosome 5 overlaps with thelocation of an extended linked cluster of growth factor and receptorgenes, some of which may be coregulated with members of the RANTES genefamily. MOLECULAR GENETICS RANTES is one of the natural ligands for thechemokine receptor CCR5 (CMKBR5; 601373) and potently suppresses invitro replication of the R5 strains of HIV-1, which use CCR5 as acoreceptor. Previous studies showing that peripheral blood mononuclearcells or CD4+ lymphocytes obtained from different individuals have widevariations in their ability to secrete RANTES prompted Liu et al. (1999)to analyze the upstream noncoding region of the RANTES gene, whichcontains cis-acting elements involved in RANTES promoter activity, in272 HIV-1-infected and 193 non-HIV-1-infected individuals in Japan. Theyfound 2 polymorphic positions, 1 of which was associated with reducedCD4+ lymphocyte depletion rates during untreated periods inHIV-1-infected individuals. This −28G mutation of the RANTES gene(187011.0001) occurred at an allele frequency of approximately 17% inthe non-HIV-1-infected Japanese population and exerted no influence onthe incidence of HIV-1 infection. Functional analyses of RANTES promoteractivity indicated that the −28G mutation increases transcription of theRANTES gene. Taken together, these data suggested that the −28G mutationincreases RANTES expression in HIV-1-infected individuals and thusdelays the progression of the HIV-1 disease. ALLELIC VARIANTS (selectedexamples) .0001 HUMAN IMMUNODEFICIENCY VIRUS TYPE 1, DELAYED DISEASEPROGRESSION WITH INFECTION BY [SCYA5, −28C-G] In a large Japanese cohortof HIV-1-infected and non-HIV-1-infected individuals, Liu et al. (1999)identified a C-to-G transversion at position −28 in the promoter of theSCYA5 gene, also referred to as the RANTES gene. The −28G allele had afrequency of approximately 17% in the Japanese population and appearedto have no influence on the incidence of HIV-1 infection. However,functional analyses indicated that the −28G mutation increasedtranscription of the RANTES gene. Liu et al. (1999) suggested that the−28G mutation increases RANTES expression in HIV-1-infected individualsand thus delays the progression of the HIV-1 disease. They showed thatthe −28G mutation is associated with reduced rates of depletion of CD4+lymphocytes in HIV-1-infected individuals, thus confirming that thispolymorphism delays HIV-1 disease progression. .0002 HUMANIMMUNODEFICIENCY VIRUS TYPE 1, RAPID DISEASE PROGRESSION WITH INFECTIONBY [SCYA5, 168923, T/C]. Among 7 SNPs within the RANTES geneinvestigated by An et al. (2002), one was the intronic RANTES regulatoryelement, In1.1T/C (168923T/C). They found that In1.1C-bearing genotypesaccounted for 37% of the attributable risk for rapid progression to AIDSamong African Americans. Because 36% of African Americans carry theIn1.1C allele, it is likely that In1.1C may have a significant impact onthe AIDS epidemic in sub- Saharan Africa.

REFERENCES

-   1. An, P.; Nelson, G. W.; Wang, L.; Donfield, S.; Goedert, J. J.;    Phair, J.; Vlahov, D.; Buchbinder, S.; Farrar, W. L.; Modi, W.;    O'Brien, S. J.; Winkler, C. A.:    Modulating influence on HIV/AIDS by interacting RANTES gene    variants. Proc. Nat. Acad. Sci. 99: 1002-1007, 2002.

PubMed ID: 11792860

-   2. Arenzana-Seisdedos, F.; Virelizier, J.-L.; Rousset, D.;    Clark-Lewis, I.; Loetscher, P.; Moser, B.; Baggiolini, M.: Table 2    Cont.    HIV blocked by chemokine antagonist. (Letter) Nature 383: 400 only,    1996.

PubMed ID: 8837769

-   3. Bakhiet, M.; Tjernlund, A.; Mousa, A.; Gad, A.; Stromblad, S.;    Kuziel, W. A.; Seiger, A.; Andersson, J.:    RANTES promotes growth and survival of human first-trimester    forebrain astrocytes. Nature Cell Biol. 3: 150-157, 2001.

TABLE 3 Statistical summary on the initial OMIM mapping efforts. Diseasegenes Percentage Total “disease genes” 1,732 Total “disease genes”mapped to build 30 1,614 93.19% Non-uniquely mapped “disease genes” 118 6.81% Total OMIM alleles: 5,624 OMIM alleles per gene: 3.25 OMIMalleles mapped to build30 3,222 57.29% OMIM alleles mapped to build30using 681 12.11% deterministic algorithm Total OMIM alleles mapped:3,903 69.40% OMIM alleles that are in dbSNP (incl. 105  1.87%deterministic shift algorithm) OMIM alleles that are in HGMD (incl.2,276 40.47% deterministic shift algorithm) Total OMIM alleles either indbSNP or HGMD 2,381 42.34%

TABLE 4 Annotated test set of 97 disease genes. Annotated Genes Total #of genes: 97  5.60% Total OMIM alleles: 529 OMIM alleles per gene: 5.45Total OMIM alleles either in dbSNP or HGMD 432 81.66% Total markerspicked for annotated genes 411 Markers picked per gene 4.24

Example 5 Ontologies for Marker Selection

In one embodiment of selecting one or more of the markers to be includedin a preselected set of markers, in accordance with the presentinvention, an example of applicable techniques that can be appliedfollows. Briefly, ‘porting’ the MeSH ontology to GO format facilitatesthe use of various software applications that have been developed formanual curation, editing and maintenance of the Gene Ontology (GO),allowing the rapid addition of human gene and disease information toMeSH, and where necessary, allowing the alteration of the MeSH ontologyitself so as to render it a more suitable container for thisinformation. A further integration of the GO terminology and refinementof the disease ontology for genetic disorders can facilitate thepreselection of the set of markers.

We outline a methodology for doing so below, and also propose atext-mining approach (Yandell M D, Majoros W H, Genomics and naturallanguage processing, Nat Rev Genet 3: 601-10 [2002]) that makes use ofthe HGMD database to jump-start the entire effort by rapidly assigningthe 1,037 human disease causing genes contained in HGMD to MeSH via anautomatic procedure. Preliminary studies indicate that assigning thehuman disease genes to MeSH can bring many, as yet unappreciatedrelationships, between sequence homology and disease to light. FIG. 7illustrates that in several cases genes involved in hand deformities arealso involved in foot deformities, and that in some cases these genesare homologous to one another, e.g., PAX3, HOXA13, and MSX1 (all arehomeodomain proteins); whereas FOXC2 and FOXE1 are paralogous genes ofthe Fork head class. In other cases, genes involved in related diseasesare involved in the interacting signaling pathways and developmentalprocesses, e.g., FGFR2 and GDF5 (for reviews see, Blundell T L et al.,Protein protein interactions in receptor activation and intracellularsignallingg, Biol Chem 381: 955-9 [2000]; Buxton P et al.,Growth/differentiation factor-5 (GDF-5) and skeletal development, J BoneJoint Surg Am 83-A: S23-30 [2001]; Thiery J P, Role of growth factorsignaling in epithelial cell plasticity during development and incarcinogenesis, Bull Acad Natl Med 185: 1279-92 [2001]). By systemicallyassigning all known human disease genes to MeSH, previously unnoticedrelationships between disease-causing genes. In many cases, geneshomologous to a known disease gene—but not yet implicated in thatdisease—may prove to be involved in causing those same diseases. Tyinggenes to MeSH provides an important resource allowing hypothesisgeneration for human disease research. Statistically significantassociations between SNPs, haplotypes, and complex diseases can besignificantly simplified by first assigning genes to MeSH.

In accordance with the invention, the assignment of genes to MeSHprovides a logical starting point by which to choose sets of geneticmarkers already implicated in a similar clinical phenotype, therebyrevealing significant correlations between these gene marker sets and(related) disease phenotypes using SNP and haplotype data—and thereby,circumventing much of the statistical noise associated with large scalestudies that attempt to identify groups of genes (without priorknowledge) using polymorphism data for all, or a randomly chosen subsetof genes. Placing genes in MeSH also provides a means by which tocross-validate more traditional approaches to SNP and haplotypeanalysis.

The core database system consists of four major database components:

-   -   1. Disease marker database including genomic sequence and        variation database    -   2. Individualized genome-profile mutation database    -   3. Clinical disease information database    -   4. Genetic profile report database

The disease marker database has very many unstructured data points fromgenome sequence strings to disease ontologies such as MeSH (SchulmanJ-L, What's New for 2001 MeSH, NLM Tech Bull. 317 [2001]). While loadingthese databases from public data repositories, various data consistencychecks need to be implemented due to the unreliable and error-prone datastored in these databases. Further attention has to be paid to the factthat there is much redundancy in gene databases, which is mostly due tothe unstructured gene nomenclature, only recently subject to efforts atunification.

For individualized genome-profiles there is physically separate storageof genetic data and the individual's geographical data due to extremelyimportant privacy concerns.

The genetic profile report database is a virtual joint over many entriesfrom all the databases above. Nevertheless, any updates made byhand-editors such as genetic counselors should propagate back to theorigin of the information storage. This versioning system is importantto reduce redundant hand-editing.

The relationship between genes including their DNA sequences, mRNAsequences and protein sequences, and clinical disease phenotypes isannotated by automatic text mining algorithms to search and pre-filterliterature database such as MEDLINE, employing a computer interface fordisease marker phenotype annotation using disease ontology and clinicalstudy abstracts (see, e.g., Example 4 above).

Example 6 Sample of Organizational Matrix for Communicating the GeneticInformation to the Individual

Existing disease ontologies such as MeSH (Schulman Schulman, J-L, What'sNew for 2001 MeSH, NLM Tech Bull. 317 [2001]) are very technical andtherefore not generally suited for the communication to individuals. Thefollowing example (Table 5) illustrates one embodiment of anorganizational matrix for communicating genetic information to theindividual in accordance with the invention. Specific focus is given tothe most significant markers, most informative markers, markersspecifically relevant for the gender, markers related to aging andmarkers that have a well-described pharmacogenomics application. Here,the individual is a hypothetical “Ms. Smith,” to whom the terminologyand structure of the report is explained in overview, and then detailedinformation is provided. Examples of other sections that are not shownin table 5 but which can be included are a glossary, a list ofreferences for further information, a health risk appraisal, and anindex. A “Rare Genetic Conditions or Carrier Report” at the end of thesample report lists, in alphabetical order, a panoply of geneticconditions for which testing was done, but is presented here intruncated form for the sake of brevity.

1-30. (canceled)
 31. A computer-implemented method for providing anevaluation indicative of a probability that an individual will exhibitone or more phenotypic attributes, comprising: (a) evaluating markers ofthe individual that are preselected based on association or otherstudies to be linked with the one or more phenotypic attributes, whereinthe markers map to at least 1,000 sites; (b) evaluating a pattern of themarkers to determine a genomic ethnicity of the individual; (c)determining a probability that the individual will exhibit the one ormore phenotypic attributes based at least in part on the pattern of themarkers and the genomic ethnicity of the individual; and (d) generatinga report with an evaluation indicative of the probability that theindividual will exhibit the one or more phenotypic attributes.
 32. Themethod according to claim 31, wherein the genomic ethnicity isdetermined with respect to haplotype blocks.
 33. The method according toclaim 32, wherein the haplotype blocks are determined based on apresence of the markers.
 34. The method according to claim 33, furthercomprising evaluating common variants associated with complex diseases,wherein the common variants are present within the haplotype blocksidentified by the markers.
 35. The method according to claim 31, furthercomprising determining a likely applicability of clinical researchresults based on the genomic ethnicity of the individual.
 36. The methodaccording to claim 31, wherein the markers comprise a plurality ofexon/intron junction sequences.
 37. The method according to claim 36,wherein at least about 20% of the markers in the preselected set areexon/intron junction sequences.
 38. The method according to claim 31,wherein the markers comprise a plurality of promoter sequences.
 39. Themethod according to claim 38, wherein at least about 20% of the markersare promoter sequences.
 40. The method according to claim 31, furthercomprising formatting the report according to an organizational matrixthat determines a grouping and presentation of information on thereport.
 41. The method according to claim 40, wherein the organizationalmatrix (i) groups together phenotypic attributes for which theindividual has an enhanced probability, or (ii) groups togetherphenotypic attributes related to similar physiological systems.
 42. Themethod according to claim 40, wherein the organizational matrix (i)ranks phenotypic attributes as a function of potential impact on alifestyle or quality of life of the individual, or (ii) ranks phenotypicattributes as a function of the genomic ethnicity of the individual. 43.The method according to claim 31, wherein in the report, an identity ofthe individual is not associated with data corresponding to anygenotypic characteristics or the probability that the individual willexhibit the one or more phenotypic attributes.
 44. The method accordingto claim 31, wherein the markers map to at least about 1,000 discreteloci.
 45. The method according to claim 31, further comprising placingassociations between the one or more phenotypic attributes and themarkers on a two-dimensional grid for display on the report.
 46. Themethod according to claim 31, further comprising performing a targetedassay to identify the markers, which assay includes nucleic acidamplification of a biological sample from the individual.
 47. The methodaccording to claim 31, further comprising providing the report fordisplay on a graphical user interface.
 48. The method according to claim31, wherein in (a), the markers are preselected based on association orother studies to be linked with at least 10 phenotypic attributes, andwherein the markers map to at least 1,000 sites.
 49. The methodaccording to claim 31, further comprising, subsequent to (b), obtaininga marker score by comparing the markers and the genomic ethnicity of theindividual to a multivariate scoring matrix that relates patterns ofmarkers and genomic ethnicity with probabilities of exhibitingphenotypic attributes, which phenotypic attributes include the one ormore phenotypic attributes.
 50. The method according to claim 49,wherein the multivariate scoring matrix scores the markers at least withrespect to three or more criteria selected from the group consisting ofpenetrance of a given marker in a population of interest, ontologicalclassification, conservation of mutated sequence sites at conserved orless conserved sites, and biological significance.
 51. The methodaccording to claim 49, wherein the multivariate scoring matrix comprisesa combination of one or more scoring matrix vectors selected from thegroup consisting of a descriptor of family history, a descriptor ofgeneral medical physiological values, a descriptor of mRNA expressionlevels, a descriptor of methylation profiles, a descriptor of proteinexpression levels, a descriptor of enzyme activity, and a descriptor ofantibody load.
 52. The method according to claim 49, further comprisingusing the marker score to determine the probability that the individualwill exhibit the one or more phenotypic attributes.